Skip to content

Conversation

@riverlijunjie
Copy link
Contributor

@riverlijunjie riverlijunjie commented Oct 24, 2025

Details:

  • Qwen3 moe model support for weight fusion compression
  • moe transformation: FuseVectorizedMOE3GEMM->ConvertMOEToMOECompressed->FuseMOECompressed
  • ov::intel_gpu::op::MOEFusedCompressed fuses softmax_topk/onehot into moe computation for performance optimization
  • prefill stage leverages gemm kernel to compute each experts output one by one
  • decode stage leverages ocl kernels to compute experts output in parallel.
  • moe exec graph:
image

Tickets:

@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Oct 24, 2025
@riverlijunjie riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch from f35b2cb to 4ccdcf1 Compare October 24, 2025 02:13
@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Oct 26, 2025
@peterchen-intel peterchen-intel marked this pull request as ready for review October 29, 2025 01:15
@peterchen-intel peterchen-intel requested review from a team as code owners October 29, 2025 01:15
@peterchen-intel peterchen-intel requested review from mryzhov and removed request for a team October 29, 2025 01:15
@github-actions github-actions bot removed the category: transformations OpenVINO Runtime library - Transformations label Oct 29, 2025
@riverlijunjie riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch 2 times, most recently from faa6533 to 836d35c Compare October 30, 2025 04:18
@riverlijunjie riverlijunjie requested a review from a team as a code owner October 30, 2025 07:23
@github-actions github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Oct 30, 2025
@peterchen-intel peterchen-intel changed the title [WIP][GPU]qwen3 moe fused compressed [GPU]qwen3 moe fused compressed Oct 30, 2025
@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Oct 30, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds GPU support for Qwen3 MoE (Mixture of Experts) models with fused compressed weight optimization. The implementation introduces a transformation pipeline that converts standard MoE operations to compressed format and fuses routing operations (softmax/topk/onehot) into the MoE computation for improved performance.

Key Changes:

  • New transformation passes: FuseVectorizedMOE3GEMMConvertMOEToMOECompressedFuseMOECompressed
  • Dual execution strategy: GEMM kernels for prefill stage, OCL kernels for decode stage
  • Memory optimization through weight compression and operation fusion

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
transformations_pipeline.cpp Registers new MOE transformation passes in the GPU plugin pipeline
moe_opt.cpp/hpp Implements optimized MOE execution with oneDNN and custom OCL kernels
moe_compressed.cpp/hpp Defines base MOECompressed operation with compressed weight configuration
moe_fused_compressed.cpp/hpp Defines MOEFusedCompressed that includes fused routing operations
convert_moe_to_compressed.cpp/hpp Transformation to convert standard MOE to compressed weight format
fuse_moe_compressed.cpp/hpp Transformation to fuse routing subgraph into MOE operation
keep_moe_const_precision.cpp/hpp Prevents precision conversion of compressed weights and zero points
moe_opt.cl, moe_mlp.cl OpenCL kernels for softmax_topk, gather, scatter, and MLP operations
paged_attention_opt.cpp Adds workaround for OCL resource issue with small input tokens
Comments suppressed due to low confidence (1)

src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_opt.cpp:1

  • Remove commented-out unused code rather than leaving it in the codebase.
// Copyright (C) 2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@riverlijunjie riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch from 0fd5af0 to 827a9f6 Compare October 31, 2025 05:26
@peterchen-intel peterchen-intel added pr: needs tests PR needs tests updating priority: high High piority and removed do_not_merge labels Oct 31, 2025
@riverlijunjie riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch from b0b0841 to 001bed4 Compare October 31, 2025 08:56
@chenhu-wang chenhu-wang force-pushed the river/qwen3_moe_fused_compressed branch from 358a015 to c824b67 Compare November 4, 2025 13:18
@github-actions github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Nov 4, 2025
@chenhu-wang chenhu-wang force-pushed the river/qwen3_moe_fused_compressed branch from c824b67 to 79d6a13 Compare November 4, 2025 13:48
@github-actions github-actions bot removed category: Core OpenVINO Core (aka ngraph) category: transformations OpenVINO Runtime library - Transformations labels Nov 4, 2025
/// shape [num_experts, hidden_size, group_num, 1]
/// 10: w2_zp - expert zp for final projection for compressed experts,
/// shape [num_experts, hidden_size, group_num, 1]
/// \param config Configuration for the MOE operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is for 3gemm_Swiglu_type only. Please mention that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

config.top_k = topk_shape.back();
config.out_type = ov::element::dynamic;
auto topk_shape = pattern_map.at(topk_m).get_partial_shape();
OPENVINO_ASSERT(topk_shape[1].is_static(), "k dimenion in moe topk input should be static.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use OPENVINO_THROW for important checking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

class FuseMOECompressed: public ov::pass::MatcherPass {
public:
OPENVINO_MATCHER_PASS_RTTI("FuseMOECompressed");
FuseMOECompressed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming is also general, but it is for Gemm3 pattern's target. Please rename this too, for reducing the confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

TEST(moe_compressed_gpu, moe_accuracy_test) {
auto& engine = get_test_engine();
if (!engine.get_device_info().supports_immad) {
std::cout << "not support immad, skip test" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove debug print.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

struct moe_fused_compressed : public primitive_base<moe_fused_compressed> {
CLDNN_DECLARE_PRIMITIVE(moe_fused_compressed)

moe_fused_compressed() : primitive_base("", {}) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify the primitive name too, for the specifc target pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

namespace details {}

template <>
struct typed_program_node<moe_fused_compressed> : public typed_program_node_base<moe_fused_compressed> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File name is too general: moe_inst.h
Pleae rename all the relevant primitives, test names, inst, node name to
moe_fused_3gemm_swiglu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@yeonbok
Copy link
Contributor

yeonbok commented Nov 4, 2025

I checked there is no impact for gpt-oss. Only minor comments added

@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Nov 5, 2025
@yeonbok yeonbok enabled auto-merge November 5, 2025 03:16
@yeonbok yeonbok added this pull request to the merge queue Nov 5, 2025
@moslex moslex added this to the 2025.4 milestone Nov 5, 2025
Merged via the queue into openvinotoolkit:master with commit e61e47a Nov 5, 2025
221 of 223 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations Code Freeze priority: high High piority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants